Code
library(tidyverse)
library(here)
library(knitr)
library(dplyr)
library(broom)
library(DT)
library(kableExtra)library(tidyverse)
library(here)
library(knitr)
library(dplyr)
library(broom)
library(DT)
library(kableExtra)Our dataset(s) in this lab concerns baby names and their popularity over time. At this link, you can find the names for ALL 50 states, in separate datasets organized by first letter. For each year, and for each name with at least 50 recorded babies born, we are given the counts of how many babies had that name.
stateNames <- read_csv(here("Labs", "Lab 9", "StateNames_A.csv"))
datatable(stateNames)Let’s take a look at how the name “Allison” has changed over time. As my name begins with “A”, you should download the StateNames_A.csv dataset from the link above.
each state should be its own row
and each sex should have its own column
if there were no babies born for that combination of state & sex there should be a 0 (not an NA)
Difference between gender and sex
The dataset has a column titled Gender, which contains two values "F" and "M", representing “Female” and “Male”. The sex someone was assigned at birth is different from their gender identity (definitions). Thus, this variable should be renamed to Sex or Sex at Birth.
stateNames <- stateNames|>
rename("Sex" = "Gender")
allison_data <- stateNames |>
filter(Name == "Allison") |>
group_by(State, Sex) |>
summarize(num_babies = sum(Count)) |>
ungroup() |>
pivot_wider(names_from = Sex,
values_from = num_babies,
values_fill = 0) |>
arrange(State)
kable(head(allison_data, 5))| State | F | M |
|---|---|---|
| AK | 232 | 0 |
| AL | 1535 | 0 |
| AR | 1198 | 0 |
| AZ | 1880 | 0 |
| CA | 12413 | 0 |
allison_f which contains only the babies assigned Female at birth.allison_f <- allison_data |>
select(State, `F`)
kable(head(allison_f, 5))| State | F |
|---|---|
| AK | 232 |
| AL | 1535 |
| AR | 1198 |
| AZ | 1880 |
| CA | 12413 |
Make a visualization showing how the popularity of the name “Allison” has changed over the years. To be clear, each year should have one observation–the total number of Allisons born that year.
allison_data <- stateNames |>
filter(Name == "Allison")
allison_summary <- allison_data |>
group_by(Year) |>
summarize(num_babies = sum(Count)) %>%
ungroup()
ggplot(allison_summary, aes(x = Year, y = num_babies)) +
geom_line() +
ggtitle("Popularity of the Name Allison Over Time") +
xlab("Year") +
ylab("Number of Babies Named Allison")model <- lm(num_babies ~ Year,
data = allison_summary)
kable(tidy(model)) |>
kable_styling(bootstrap_options = "striped")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 209815.052 | 42883.24874 | 4.892704 | 0.0001626 |
| Year | -101.581 | 21.38275 | -4.750606 | 0.0002172 |
augmented <- model |>
augment()
ggplot(data = augmented,
mapping = aes(x = .fitted,
y = .resid)) +
geom_point() +
labs(title = "Plot of Residual vs Fitted Points of Linear Model",
y = "Residual",
x = "Fitted")In middle school I was so upset with my parents for not naming me "Allyson". Past my pre-teen rebellion, I'm happy with my name and am impressed when baristas spell it "Allison" instead of "Alison". But I don't have it as bad as my good friend Allan!
allans <- stateNames |>
filter(Name == c("Allan", "Alan", "Allen"),
Sex == "M") |>
group_by(Year, Name) |>
summarize(num_names = sum(Count))
ggplot(data = allans,
mapping = aes(x = Year,
y = num_names,
color = Name)) +
geom_line() +
labs(title = "Number of babies with the names Alan, Allan, or Allen, per year",
y = "",
x = "Year")Filtering multiple values
It looks like you want to filter for a vector of values. What tools have you learned which can help you accomplish this task?
each spelling should be its own column
each state should have its own row
a 0 (not an NA) should be used to represent locations where there were no instances of these names
allans2 <- stateNames |>
filter(Name %in% c("Allan", "Alan", "Allen"),
State %in% c("CA", "PA"),
Year == 2000) |>
group_by(State, Name) |>
summarize(num_babies = sum(Count)) |>
ungroup() |>
pivot_wider(names_from = Name,
values_from = num_babies,
values_fill = 0)
kable(allans2)| State | Alan | Allan | Allen |
|---|---|---|---|
| CA | 584 | 131 | 176 |
| PA | 51 | 12 | 56 |
totals <- allans2 |>
group_by(State) |>
summarize(Total = sum(Allan, Allen, Alan))
allans2 |>
left_join(totals) |>
mutate(Allan = (Allan / Total) * 100,
Allen = (Allen / Total) * 100,
Alan = (Alan / Total) * 100) |>
select(-Total) |>
kable(align = c("c", "c", "c", "c")) |>
kable_styling(font_size = 14) |>
add_header_above(header = c("", "%", "%", "%"))| State | Alan | Allan | Allen |
|---|---|---|---|
| CA | 65.54433 | 14.70258 | 19.75309 |
| PA | 42.85714 | 10.08403 | 47.05882 |